For the final part of this project, we will extend our study of text classification. Using Principal Component Analysis and Truncated Singular Value Decomposition (methods for dimensionality reduction) we will attempt to replicate the same quality of modeling with a fraction of the features.
The outline of the procedure we are going to follow is:
Motivations
In [1]:
import warnings
warnings.filterwarnings('ignore')
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from scipy import stats
import seaborn as sns
sns.set(rc={"axes.labelsize": 15});
# Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 10, 7.5;
plt.rcParams['axes.grid'] = True;
plt.gray();
In [3]:
#Reading the dataset in a dataframe using Pandas
df = pd.read_csv("data.csv")
#Print first observations
df.head()
Out[3]:
In [4]:
df.columns
Out[4]:
Our first collection of feature vectors will come from the Restaurant_Name column. We are still trying to predict whether a restaurant falls under the "pristine" category (Grade A, score greater than 90) or not. We could also try to see whether we could predict a restaurant's grade (A, B, C or F)
In [5]:
from sklearn.feature_extraction.text import CountVectorizer
# Turn the text documents into vectors
vectorizer = CountVectorizer(min_df=1, stop_words="english")
X = vectorizer.fit_transform(df['Restaurant_Name']).toarray()
y = df['Letter_Grade']
target_names = y.unique()
In [6]:
# Train/Test split and cross validation:
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, train_size = 0.8)
In [7]:
X_train.shape
Out[7]:
Even though we do not have more features (3430) than rows of data (14888), we can still attempt to reduce the feature space by using Truncated SVD:
Once we have extracted a vector representation of the data, it's a good idea to project the data on the first 2D of a Singular Value Decomposition (i.e.. Principal Component Analysis) to get a feel of the data. Note that the TruncatedSVD class can accept scipy.sparse matrices as input (as an alternative to numpy arrays). We will use it to visualize the first two principal components of the vectorized dataset.
In [8]:
from sklearn.decomposition import TruncatedSVD
svd_two = TruncatedSVD(n_components=2, random_state=42)
X_train_svd = svd_two.fit_transform(X_train)
In [9]:
pc_df = pd.DataFrame(X_train_svd) # cast resulting matrix as a data frame
sns.pairplot(pc_df, diag_kind='kde');
In [10]:
# Percentage of variance explained for each component
def pca_summary(pca):
return pd.DataFrame([np.sqrt(pca.explained_variance_),
pca.explained_variance_ratio_,
pca.explained_variance_ratio_.cumsum()],
index = ["Standard deviation", "Proportion of Variance", "Cumulative Proportion"],
columns = (map("PC{}".format, range(1, len(pca.components_)+1))))
In [11]:
pca_summary(svd_two)
Out[11]:
In [12]:
# Only 3.5% of the variance is explained in the data
svd_two.explained_variance_ratio_.sum()
Out[12]:
In [13]:
from itertools import cycle
def plot_PCA_2D(data, target, target_names):
colors = cycle('rgbcmykw')
target_ids = range(len(target_names))
plt.figure()
for i, c, label in zip(target_ids, colors, target_names):
plt.scatter(data[target == i, 0], data[target == i, 1],
c=c, label=label)
plt.legend()
In [14]:
plot_PCA_2D(X_train_svd, y_train, target_names)
This must be the most uninformative plot in the history of plots. Obviously 2 principal components aren't enough. Let's try with 100:
In [15]:
# Now, let's try with 100 components to see how much it explains
svd_hundred = TruncatedSVD(n_components=100, random_state=42)
X_train_svd_hundred = svd_hundred.fit_transform(X_train)
In [16]:
# 43.7% of the variance is explained in the data for 100 dimensions
# This is mostly due to the High dimension of data and sparcity of the data
svd_hundred.explained_variance_ratio_.sum()
Out[16]:
In [17]:
plt.figure(figsize=(10, 7))
plt.bar(range(100), svd_hundred.explained_variance_)
Out[17]:
Is it worth it to keep adding dimensions? Recall that we started with a 3430-dimensional feature space which we have already reduced to 100 dimensions, and according to the graph above each dimension over the 100th one will be adding less than 0.5% in our explanation of the variance. Let us try once more with 300 dimensions, to see if we can get something respectably over 50% (so we can be sure we are doing better than a coin toss)
In [18]:
svd_sparta = TruncatedSVD(n_components=300, random_state=42)
X_train_svd_sparta = svd_sparta.fit_transform(X_train)
In [19]:
X_test_svd_sparta = svd_sparta.fit_transform(X_test)
In [20]:
svd_sparta.explained_variance_ratio_.sum()
Out[20]:
66.2% of the variance is explained through our model. This is quite respectable.
In [21]:
plt.figure(figsize=(10, 7))
plt.bar(range(300), svd_sparta.explained_variance_)
Out[21]:
In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import cross_validation
from sklearn.naive_bayes import MultinomialNB
# Fit a classifier on the training set
classifier = MultinomialNB().fit(np.absolute(X_train_svd_sparta), y_train)
print("Training score: {0:.1f}%".format(
classifier.score(X_train_svd_sparta, y_train) * 100))
# Evaluate the classifier on the testing set
print("Testing score: {0:.1f}%".format(
classifier.score(X_test_svd_sparta, y_test) * 100))
In [23]:
streets = df['Geocode'].apply(pd.Series)
In [24]:
streets = df['Geocode'].tolist()
In [25]:
split_streets = [i.split(' ', 1)[1] for i in streets]
In [26]:
split_streets = [i.split(' ', 1)[1] for i in split_streets]
In [27]:
split_streets = [i.split(' ', 1)[0] for i in split_streets]
In [28]:
split_streets[0]
Out[28]:
In [29]:
import re
shortword = re.compile(r'\W*\b\w{1,3}\b')
In [30]:
for i in range(len(split_streets)):
split_streets[i] = shortword.sub('', split_streets[i])
In [31]:
# Create a new column with the street:
df['Street_Words'] = split_streets
In [32]:
from sklearn.feature_extraction.text import CountVectorizer
# Turn the text documents into vectors
vectorizer = CountVectorizer(min_df=1, stop_words="english")
X = vectorizer.fit_transform(df['Street_Words']).toarray()
y = df['Letter_Grade']
target_names = y.unique()
In [33]:
# Train/Test split and cross validation:
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, train_size = 0.8)
In [34]:
X_train.shape
Out[34]:
In [35]:
from sklearn.decomposition import TruncatedSVD
svd_two = TruncatedSVD(n_components=2, random_state=42)
X_train_svd = svd_two.fit_transform(X_train)
In [36]:
pc_df = pd.DataFrame(X_train_svd) # cast resulting matrix as a data frame
sns.pairplot(pc_df, diag_kind='kde');
In [37]:
pca_summary(svd_two)
Out[37]:
In [38]:
# 25% of the variance is explained in the data when we use only TWO principal components!
svd_two.explained_variance_ratio_.sum()
Out[38]:
In [44]:
# Now, let's try with 10 components to see how much it explains
svd_ten = TruncatedSVD(n_components=10, random_state=42)
X_train_svd_ten = svd_ten.fit_transform(X_train)
In [45]:
# 53.9% of the variance is explained in the data for 10 dimensions
# This is mostly due to the High dimension of data and sparcity of the data
svd_ten.explained_variance_ratio_.sum()
Out[45]: